Skip to content

feat(huggingFace): add HuggingFaceModelResource for model browsing and media proxy#5124

Open
PG1204 wants to merge 11 commits into
apache:mainfrom
ELin2025:hf/01-backend-skeleton
Open

feat(huggingFace): add HuggingFaceModelResource for model browsing and media proxy#5124
PG1204 wants to merge 11 commits into
apache:mainfrom
ELin2025:hf/01-backend-skeleton

Conversation

@PG1204
Copy link
Copy Markdown
Contributor

@PG1204 PG1204 commented May 17, 2026

What changes were proposed in this PR?

Introduces HuggingFaceModelResource - a Jersey REST resource at /api/huggingface/* that backs the upcoming HuggingFace operator's model picker, audio upload, and media preview UI. Five endpoints:

Endpoint Purpose
GET /api/huggingface/models?task=…[&search=…] Browse or search HF models
GET /api/huggingface/tasks List HF pipeline tags with hosted inference
POST /api/huggingface/upload-audio?filename=… Stream-upload audio files
GET /api/huggingface/audio-preview?path=… Stream uploaded audio back
GET /api/huggingface/media-proxy?url=… Proxy allowlisted remote media URLs (CORS bypass)

Plus a single-line registration of the resource in TexeraWebApplication.

Architectural notes:

  • Token sourcing: the user's HF token arrives via the X-HF-Token request header (forwarded by the frontend from the operator's property panel in a follow-up PR). When absent, requests go to HF Hub anonymously. There is no server-side env-var token.
  • Caching: bounded Guava Cache (size + TTL) for /models and /tasks results. User-token requests bypass the cache to avoid serving one user's token-scoped list to another.
  • Streaming upload: /upload-audio reads InputStream straight to disk in 8 KB chunks with a 25 MiB cap (returns 413 on exceedance) - the request body is never buffered in memory. Extension allowlist rejects non-audio types up front.
  • SSRF protection: /media-proxy requires the URL's host to be in an allowlist (HF, fal.media, replicate.delivery/com) with a leading-dot suffix guard against lookalike domains.
  • Bounded fan-out: /tasks uses a dedicated ForkJoinPool(4) for its per-task probe instead of the JVM's global common pool, with explicit 429/503 detection that logs at WARN.
  • Truncation visibility: browse and search responses carry an X-Texera-Truncated: true header when results were capped, so the frontend can show "list incomplete" hints.
  • Error responses: generic Jackson-built JSON bodies (no exception internals leak to clients); details are logged server-side.

Any related issues, documentation, or discussions?

Tracked in #5134 & #5041(umbrella issue for the HuggingFace operator end-to-end implementation). This PR is the backend foundation; subsequent PRs will add the operator class, frontend property panel, result-panel media rendering, and developer documentation.

Closes #5134

How was this PR tested?

  • Unit tests: amber/src/test/scala/.../HuggingFaceModelResourceSpec.scala - 86 ScalaTest cases covering token sanitization, SSRF allowlist (including lookalike-domain rejection), JSON error escaping, MIME type inference, the audio-upload validation/size-cap/extension paths, audio-preview path validation and traversal rejection, media-proxy rejection paths, cache hit/bypass semantics, and the temp-dir sweep. Run with sbt 'WorkflowExecutionService/testOnly org.apache.texera.web.resource.HuggingFaceModelResourceSpec' - all 86 pass in ~6 seconds, no external network required.
  • Manual smoke tests against a local backend:
    • GET /api/huggingface/tasks returns the expected JSON task list.
    • GET /api/huggingface/models?task=text-generation returns the paginated model list; text-generation shows the X-Texera-Truncated: true header when MAX_PAGES=50 is hit.
    • POST /upload-audio?filename=evil.sh → 400 (extension allowlist).
    • POST /upload-audio with a 30 MiB body → 413 (size cap).
    • GET /media-proxy?url=http://localhost:8080/ → 403 (SSRF allowlist).

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.7 in compliance with ASF

…d media proxy

Introduces a new Jersey REST resource exposing endpoints used by the
upcoming HuggingFace operator UI:

- GET  /api/huggingface/models       — browse / search models per task
- GET  /api/huggingface/tasks        — list HF pipeline tags with hosted inference
- POST /api/huggingface/upload-audio — upload audio for HF audio tasks
- GET  /api/huggingface/audio-preview — stream uploaded audio (path-validated)
- GET  /api/huggingface/media-proxy   — proxy remote media URLs to bypass CORS

This is the first PR in a stacked series landing the HF operator end-to-end.
No operator code yet; this resource is independently useful and lets the
frontend integrate with HF before the operator class lands.
@PG1204
Copy link
Copy Markdown
Contributor Author

PG1204 commented May 17, 2026

/request-review @Ma77Ball

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 17, 2026

Codecov Report

❌ Patch coverage is 66.85393% with 118 lines in your changes missing coverage. Please review.
✅ Project coverage is 47.85%. Comparing base (953e2c4) to head (2b852ae).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...texera/web/resource/HuggingFaceModelResource.scala 67.04% 90 Missing and 27 partials ⚠️
...a/org/apache/texera/web/TexeraWebApplication.scala 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5124      +/-   ##
============================================
- Coverage     49.16%   47.85%   -1.31%     
- Complexity     2384     2401      +17     
============================================
  Files          1051     1043       -8     
  Lines         40350    40261      -89     
  Branches       4279     4302      +23     
============================================
- Hits          19837    19268     -569     
- Misses        19353    19834     +481     
+ Partials       1160     1159       -1     
Flag Coverage Δ *Carryforward flag
access-control-service 39.41% <ø> (-2.49%) ⬇️ Carriedforward from 5e95bcd
agent-service 33.76% <ø> (ø) Carriedforward from 5e95bcd
amber 51.96% <66.85%> (+0.30%) ⬆️
computing-unit-managing-service 0.00% <ø> (ø) Carriedforward from 5e95bcd
config-service 0.00% <ø> (ø) Carriedforward from 5e95bcd
file-service 32.09% <ø> (-6.33%) ⬇️ Carriedforward from 5e95bcd
frontend 37.93% <ø> (-3.15%) ⬇️ Carriedforward from 5e95bcd
python 90.50% <ø> (-0.30%) ⬇️ Carriedforward from 5e95bcd
workflow-compiling-service 56.81% <ø> (ø) Carriedforward from 5e95bcd

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Yicong-Huang
Copy link
Copy Markdown
Contributor

@PG1204 Thanks for opening this PR! Please do the following:

  1. please follow our PR template and make the description concise.
  2. please make sure your code meets the test coverage.
  3. please use issues to describe future plans such as stacked PRs. This is because each PR after merge will become immutable. Issues can hold information that is longer than a PR's life cycle, and can subject to updates. If you are planning for opening multiple PRs, I suggest you use an umbrella issue to contain multiple sub issues, each for one PR.
  4. you can use /request-review @xxx to request reviewer for review.

@PG1204
Copy link
Copy Markdown
Contributor Author

PG1204 commented May 18, 2026

@Yicong-Huang

Thank you for the suggestions. Will update the PR accordingly.

@Ma77Ball
Copy link
Copy Markdown
Contributor

Hi @PG1204, while I begin my review, please address @Yicong-Huang's feedback. Specifically:

  1. Update the PR description to follow this template exactly:
   ### What changes were proposed in this PR?
   ...
   ### Any related issues, documentation, or discussions?
   ...
   ### How was this PR tested?
   ...
   ### Was this PR authored or co-authored using generative AI tooling?
   ...
  1. Add test coverage for as much of the new code as possible. At a minimum, please cover the main features and call paths introduced here.
  2. Relocate the overall PR plan to the parent issue, and keep this PR's description scoped to the code changes it actually contains.
  3. Document any architectural changes. If this PR modifies the architecture, please describe what changed and where, so reviewers can follow the design intent.

Thanks, and looking forward to the updates!

Copy link
Copy Markdown
Contributor

@Ma77Ball Ma77Ball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review and resolve the comments and ask any questions as needed.

@PG1204
Copy link
Copy Markdown
Contributor Author

PG1204 commented May 20, 2026

/request-review @Ma77Ball requesting re-review for the changes.

Copy link
Copy Markdown
Contributor

@Ma77Ball Ma77Ball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Note

Suggestions above that were not resolved should be resolved in the upcoming PRs. Also, test cases should be added in future PRs to address the missing lines reported by codecov.

@Ma77Ball
Copy link
Copy Markdown
Contributor

/request-review @xuang7

Copy link
Copy Markdown
Contributor

@xuang7 xuang7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good overall. I left two comments. Please also resolve any existing comments if they can be addressed in this PR, and mark them as resolved.

PG1204 and others added 2 commits May 28, 2026 07:01
Addresses xuang7's review on PR apache#5124 — both endpoints previously
buffered the full payload into a heap-resident byte[] with no upper
bound, leaving the JVM open to OOM on a hostile or buggy upstream
response (/media-proxy) or out-of-band write into the audio temp dir
(/audio-preview).

- /media-proxy: switch from Unirest.asBytes() to
  asObject(Function<RawResponse, T>), streaming the upstream body in
  8 KiB chunks with a running byte counter. Aborts with 413 if the
  declared Content-Length exceeds the cap (pre-check) or if the body
  crosses the cap mid-read (defends against missing/lying
  Content-Length). New MAX_MEDIA_PROXY_BYTES = 50 MiB, sized for HF
  inference media (text-to-image ~5 MiB, text-to-video ~30 MiB) with
  headroom.
- /audio-preview: add Files.size() defense-in-depth check before
  readAllBytes. /upload-audio already enforces MAX_AUDIO_BYTES on
  ingest; this catches the case where a bug or out-of-band write puts
  an oversized file in the temp dir.

Adds a spec covering the audio-preview cap using a sparse-file fixture
so the test stays fast (87/87 spec passes). The media-proxy cap path
is exercised via the existing input-validation suite plus the new
streamMediaWithCap helper - a follow-up can add a fake-RawResponse
unit test if reviewers want explicit coverage of the chunked-read cap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PG1204 added a commit to ELin2025/texera that referenced this pull request May 28, 2026
…eration

Splits the monolithic 1,278-line HuggingFaceInferenceOpDesc from the
team's feature branch into a dispatcher + per-task codegen architecture
and ships the first task family (text-generation) end-to-end.

- TaskCodegen trait + CodegenContext model the per-task variation
- PythonCodegenBase emits the shared provider-fallback / process_table /
  _parse_response infrastructure with two holes for the per-task payload
  and parse snippets
- TextGenCodegen supplies text-generation's chat-completions payload and
  the body["choices"][0]["message"]["content"] parse branch
- HuggingFaceInferenceOpDesc becomes a thin dispatcher (~180 lines)
  holding @JsonProperty fields and the registeredCodegens map

User-input string fields are typed as EncodableString and emitted via
the pyb"..." macro so values reach Python as
self.decode_python_template('<base64>') rather than raw literals; class
constants are assigned in open(self) so self is in scope for the decode
call. Generated process_table runs a defensive _HF_MODEL_ID_PATTERN
check at runtime before any HF URL is composed.

PR 2 of a stacked 9-PR series. PR 1 (apache#5124) ships the supporting REST
resource; PRs 3-5 will add image, audio + media-gen, and QA/ranking
task families by registering new *Codegen objects in the dispatcher.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@PG1204
Copy link
Copy Markdown
Contributor Author

PG1204 commented May 28, 2026

@Ma77Ball Would you prefer that I resolve the conversations or you'd rather resolve them. If any of the comments still require work, I shall work on them and update the PR.

PG1204 added a commit to ELin2025/texera that referenced this pull request May 28, 2026
…eration

Splits the monolithic 1,278-line HuggingFaceInferenceOpDesc from the
team's feature branch into a dispatcher + per-task codegen architecture
and ships the first task family (text-generation) end-to-end.

- TaskCodegen trait + CodegenContext model the per-task variation
- PythonCodegenBase emits the shared provider-fallback / process_table /
  _parse_response infrastructure with two holes for the per-task payload
  and parse snippets
- TextGenCodegen supplies text-generation's chat-completions payload and
  the body["choices"][0]["message"]["content"] parse branch
- HuggingFaceInferenceOpDesc becomes a thin dispatcher (~180 lines)
  holding @JsonProperty fields and the registeredCodegens map

User-input string fields are typed as EncodableString and emitted via
the pyb"..." macro so values reach Python as
self.decode_python_template('<base64>') rather than raw literals; class
constants are assigned in open(self) so self is in scope for the decode
call. Generated process_table runs a defensive _HF_MODEL_ID_PATTERN
check at runtime before any HF URL is composed.

PR 2 of a stacked 9-PR series. PR 1 (apache#5124) ships the supporting REST
resource; PRs 3-5 will add image, audio + media-gen, and QA/ranking
task families by registering new *Codegen objects in the dispatcher.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@xuang7 xuang7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!



Per review on apache#5124 (xuang7, Ma77Ball): mark the resource with
@RolesAllowed(Array("REGULAR", "ADMIN")) to document that all five
endpoints require an authenticated user. The annotation isn't enforced
yet — that's coming with the auth-enforcement PR @Yicong-Huang and
@Ma77Ball are working on — but adding it now means no follow-up
change is needed when enforcement lands, and it matches the convention
used by UserConfigResource / AdminSettingsResource.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PG1204 added a commit to ELin2025/texera that referenced this pull request May 29, 2026
…eration

Splits the monolithic 1,278-line HuggingFaceInferenceOpDesc from the
team's feature branch into a dispatcher + per-task codegen architecture
and ships the first task family (text-generation) end-to-end.

- TaskCodegen trait + CodegenContext model the per-task variation
- PythonCodegenBase emits the shared provider-fallback / process_table /
  _parse_response infrastructure with two holes for the per-task payload
  and parse snippets
- TextGenCodegen supplies text-generation's chat-completions payload and
  the body["choices"][0]["message"]["content"] parse branch
- HuggingFaceInferenceOpDesc becomes a thin dispatcher (~180 lines)
  holding @JsonProperty fields and the registeredCodegens map

User-input string fields are typed as EncodableString and emitted via
the pyb"..." macro so values reach Python as
self.decode_python_template('<base64>') rather than raw literals; class
constants are assigned in open(self) so self is in scope for the decode
call. Generated process_table runs a defensive _HF_MODEL_ID_PATTERN
check at runtime before any HF URL is composed.

PR 2 of a stacked 9-PR series. PR 1 (apache#5124) ships the supporting REST
resource; PRs 3-5 will add image, audio + media-gen, and QA/ranking
task families by registering new *Codegen objects in the dispatcher.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add HuggingFaceModelResource REST endpoints for HF operator UI

6 participants